Why visualize data?
Good visualizations can give:
Powerful summaries of the underlying data
Communicate insights often to audiences who do not have the same luxury of spending so much time with the data as you do.
As a Data analyst/ Scientist, it’s your responsibility to give the necessary high level summaries or takeaways in any data visual you create.
Some Features of Good Visualizations
Clear on what they’re communicating
Well defined axis, with the right scaling and labels
Good choice of colors and anotations (visually appealing)
Less is more
Some Features of Bad Visualizations
Cluttered, too much going on in the chart with no clear communication goal
Truncating axes to start at non-zero values which distorts interpretation
Poor choice of colors
Unnecessary 3D-fying
Our data for today - Netflix Movies & TV Shows
library(tidyverse) # meta-package for data analysis in R
library(plotly) # creating interactive visualizations
library(DT) # nice table formatting
netflix <- read_csv("Data/netflix_titles.csv/netflix_titles.csv")
# head(netflix,5) %>% kbl() %>%
# kable_styling()
# Get a high-level summary of the data
summary(netflix)## show_id type title director
## Length:8807 Length:8807 Length:8807 Length:8807
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## cast country date_added release_year
## Length:8807 Length:8807 Length:8807 Min. :1925
## Class :character Class :character Class :character 1st Qu.:2013
## Mode :character Mode :character Mode :character Median :2017
## Mean :2014
## 3rd Qu.:2019
## Max. :2021
## rating duration listed_in description
## Length:8807 Length:8807 Length:8807 Length:8807
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
# For a more detailed summary of the data
skimr::skim(netflix)| Name | netflix |
| Number of rows | 8807 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| show_id | 0 | 1.00 | 2 | 5 | 0 | 8807 | 0 |
| type | 0 | 1.00 | 5 | 7 | 0 | 2 | 0 |
| title | 0 | 1.00 | 1 | 104 | 0 | 8807 | 0 |
| director | 2634 | 0.70 | 2 | 208 | 0 | 4528 | 0 |
| cast | 825 | 0.91 | 3 | 771 | 0 | 7692 | 0 |
| country | 831 | 0.91 | 4 | 123 | 0 | 748 | 0 |
| date_added | 10 | 1.00 | 11 | 18 | 0 | 1714 | 0 |
| rating | 4 | 1.00 | 1 | 8 | 0 | 17 | 0 |
| duration | 3 | 1.00 | 5 | 10 | 0 | 220 | 0 |
| listed_in | 0 | 1.00 | 6 | 79 | 0 | 514 | 0 |
| description | 0 | 1.00 | 61 | 248 | 0 | 8775 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| release_year | 0 | 1 | 2014.18 | 8.82 | 1925 | 2013 | 2017 | 2019 | 2021 | ▁▁▁▁▇ |
Some Bad Visualizations
Example 1
# A pie chart of ratings
pie_data <- netflix %>%
filter(type == 'Movie') %>%
group_by(rating) %>%
summarize(count = n())
pie_data %>% datatable()## There's no simple function to create a bar chart in R using ggplot for a reason.
# pie_data %>%
# ggplot(aes(x = "", y = count, fill = rating)) +
# geom_bar(stat = "identity", width = 1) +
# coord_polar("y", start = 0) +
# theme_void()
pie <- plot_ly(pie_data, labels = ~rating, values = ~count, type = 'pie')
pie <- pie %>%
layout(title = 'Top Netflix Movie ratings',width = 700, height = 500,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
pieWhat’s wrong with that plot?
The visualization is bad because:
It’s vague, putting together all movie ratings does help the audience identify what you’re trying to communicate.
The rating categories are too many. Remember, good visuals give high level summaries (less is more)
The pie chart used here is not the best tool for comparing multiple categories.
Pie charts also make it difficult for your audience to judge the relative sizes of the slices.
Let’s look at another.
Example 2
# Movie Ratings over time
rating_data <- netflix %>%
select(type, release_year, rating) %>%
group_by(release_year, rating, type) %>%
summarise(frequency = n())
# Initial line plot
rating_plot_1 <- rating_data %>%
ggplot(aes(x = release_year,y = frequency,group = rating, color = rating)) +
geom_line(size = 1.5)
rating_plot_1 %>% ggplotly(width = 800, height = 500)Examples of Good Visualizations
Example 1
# A bar chart of ratings
Movie_bar_data <- netflix %>% filter(type == 'Movie') %>% group_by(rating) %>%
summarize(count = n()) %>% arrange(desc(count)) %>% slice_head(n = 5)
bar <- Movie_bar_data %>%
ggplot(aes(x = reorder(rating, -count, sum), y = count)) +
geom_col(fill = c('#f9007a','#D6DBDF','#D6DBDF','#D6DBDF','#D6DBDF')) +
xlab('Ratings') + ylab('Frequency') +
ggtitle('Top 5 Movie Ratings on Netflix') +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(size = 18,face = "bold"),
axis.title = element_text(size = 14)
)
bar # %>% ggplotly(width = 800, height = 500,tooltip = FALSE)Example 2
# Improve line chart to draw out insights
rating_data <- rating_data %>%
# Interested in Movies released from the year 2000 onwards
filter(type == "Movie" & release_year >= 2000 & release_year < 2020) %>%
mutate( highlight = ifelse(rating == "TV-MA", "TV-MA", "Others"))
# Get overall growth rates over a 10 and 20 year period
min_year = min(rating_data$release_year)
max_year = max(rating_data$release_year)
# The number of movies with mature rating for the 2010, minimum and maximum years
rate_range <- rating_data %>%
filter(release_year == min_year | release_year == max_year | release_year == '2010') %>%
filter(highlight == 'TV-MA')
# Compute 20 year and 10 year growth rates
growth_20 <- (rate_range$frequency[3]/rate_range$frequency[1] - 1) * 100
growth_10 <- round((rate_range$frequency[3]/rate_range$frequency[2] - 1) * 100,0)
# Revised line plot
rating_plot_2 <- rating_data %>%
ggplot( aes(x = release_year, y = frequency, group = rating, color = highlight)) +
geom_line(size = 1.5) +
scale_color_manual(values = c("#D6DBDF","#f9007a")) +
xlab("Release year") + ylab("Number of Movies") +
ggtitle('Increase In Mature Content Over The Last Decade') +
theme_minimal() +
theme(legend.position = "none") +
geom_label( x = 2013.5, y = 300,
label = glue::glue("Shows for Mature Audiences \n increased {growth_10}% over the last decade"),
size = 4, color = "#34495E") +
theme(
legend.position = "none",
plot.title = element_text(size = 18,face = "bold"),
axis.title = element_text(size = 14)
)
rating_plot_2 #%>% ggplotly(width = 800, height = 500)Visit this github repo for the code.